HAD Phase 4: trends_lin (Eq 17/18) + R-package end-to-end parity by igerber · Pull Request #392 · igerber/diff-diff

igerber · 2026-04-26T13:10:42Z

Summary

Adds trends_lin: bool = False keyword-only kwarg on HeterogeneousAdoptionDiD.fit(aggregate="event_study"), joint_pretrends_test, joint_homogeneity_test. Mirrors R DIDHAD::did_had(..., trends_lin=TRUE) (paper Eq 17 / Eq 18 / page 32 joint-Stute-with-trends).
Adds end-to-end R-package parity test vs DIDHAD v2.0.0 (Credible-Answers/did_had, SHA edc09197). 24 parity tests across 3 paper-derived synthetic DGPs (Uniform, Beta(2,2), Beta(0.5,1)) × 5 method combinations (overall, event-study, placebo, yatchew, trends_lin).
Pierce-Schott published-number replication (the original Phase 4 plan) deferred indefinitely — the paper's primary analysis panel is LBD-restricted (Census FSRDC) and the public-deposit proxy panel has filtering ambiguity that prevents exact published-number parity. End-to-end R parity is a strictly stronger correctness signal — the R package IS the methodology source of truth (the paper authors wrote it).

Tolerances achieved

Point estimate / SE / CI bounds: atol=1e-8 (full pipeline through nprobust numerical paths).
Closed-form Yatchew T-stat: atol=1e-10 after a documented × G/(G-1) finite-sample convention shift.
24/24 parity cells pass.

Two intentional convention deviations from R (documented in REGISTRY.md)

(a) Point estimate convention: we report the bias-corrected point estimate att = (mean(ΔY) - tau.bc) / mean(D) (modern CCF 2018 convention); R's Estimate column reports the conventional (mean(ΔY) - tau.us) / mean(D) with the bias-corrected CI separately. Our att matches R's CI midpoint; our se / conf_int_low / conf_int_high match R's se / ci_lo / ci_hi directly.

(b) Yatchew variance denominator: our yatchew_hr_test follows paper Appendix E literally with the (1/G) and (1/(2G)) population-variance denominators. R's YatchewTest::yatchew_test uses base-R var()'s (1/(N-1)) sample-variance convention. Ratio is exactly N/(N-1); both converge to the same asymptotic null distribution.

Feature gap (TODO follow-up)

yatchew_hr_test always fits Y ~ D (linearity null). R's YatchewTest(order=0) fits Y ~ 1 (intercept-only, mean-independence null), which is what DIDHAD uses on placebo rows ("non-parametric pre-trends test" per the README). Adding a null="mean_independence" mode to yatchew_hr_test would close the placebo Yatchew parity gap; tracked in TODO.md.

Test plan

16 new direct trends_lin unit tests (tests/test_had_pretests.py::TestJointPretestsTrendsLin, ::TestHADFitTrendsLin): F<3 guard, naive-Python detrending baseline at atol=1e-12, survey_design × trends_lin NotImplementedError, get/set_params idempotence, consumed-placebo (e=-2) drop invariant.
24 new R-parity tests (tests/test_did_had_parity.py::TestPointSEParity, ::TestYatchewParity, ::TestFixtureMetadata).
Bit-exact backcompat: trends_lin=False (default) preserves all 489 baseline test outputs unchanged.
Total: 529 tests pass, 0 regressions.
R generator runs end-to-end: Rscript benchmarks/R/generate_did_had_golden.R writes benchmarks/data/did_had_golden.json (117 KB committed).
black / ruff clean on edited files.

🤖 Generated with Claude Code

github-actions · 2026-04-26T13:17:44Z

Overall Assessment

⛔ Blocker

Review note: this was a static review only. The container here does not have pytest or numpy, so I could not execute the added tests.

Executive Summary

joint_pretrends_test(trends_lin=True) does not implement the documented “consumed placebo” drop. When base_period-1 is the only placebo horizon, the code deterministically turns that horizon into all zeros and the joint Stute core returns a mechanical non-rejection.
The new trends_lin kwarg was not propagated to did_had_pretest_workflow(...), even though that workflow is the in-file wrapper over the updated joint-pretest surfaces.
The new R parity suite does not actually exercise Python’s aggregate="overall" path, despite the changelog/registry claiming overall-path parity coverage.
The survey-weighted trends_lin limitation and the placebo-Yatchew null mismatch are both documented in REGISTRY.md and tracked in TODO.md, so I did not count those as defects.

Methodology

P0 — diff_diff/had_pretests.py L3575-L3625, diff_diff/had_pretests.py L3054-L3089, docs/methodology/REGISTRY.md L2501-L2502: joint_pretrends_test(trends_lin=True) keeps base_period-1 in pre_periods instead of dropping the consumed placebo that the registry/changelog say must be removed. Algebraically, for t = base_period-1, the new code sets dy_t = (Y_t - Y_base) - (t-base)*(Y_base - Y_{base-1}) = 0 for every unit, then centers that zero vector and feeds it to stute_joint_pretest, whose exact-linear short-circuit returns p_value=1.0. Impact: with only one earlier pre-period, the function silently reports a conclusive pass when there is actually no valid placebo horizon left after detrending; with multiple earlier pre-periods, the returned horizon metadata still includes a placebo the method says should have been consumed. Concrete fix: explicitly remove base_period-1 from the tested placebo set when trends_lin=True; if nothing remains, raise/skip explicitly instead of calling stute_joint_pretest.

Code Quality

P1 — diff_diff/had_pretests.py L4145-L4160, diff_diff/had_pretests.py L4454-L4486: the new public trends_lin parameter was added to joint_pretrends_test and joint_homogeneity_test, but the wrapper did_had_pretest_workflow(...) neither accepts it nor forwards it. Impact: the library’s end-to-end event-study pretest workflow cannot run the new detrended methodology path at all, so the wrapper surface diverges from the underlying public methods. Concrete fix: add trends_lin: bool = False to did_had_pretest_workflow(...), forward it to both joint wrappers, and mirror the same guard/skip behavior there.

Performance

No findings in the modified code.

Maintainability

No additional findings beyond the wrapper-propagation issue above.

Tech Debt

P3 — TODO.md L105-L106, docs/methodology/REGISTRY.md L2501-L2502: the unsupported trends_lin × survey_design combination and the placebo-Yatchew null mismatch are explicitly documented/tracked. Impact: no blocker for this PR under the project’s deferred-work policy. Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

P2 — tests/test_did_had_parity.py L129-L141, benchmarks/R/generate_did_had_golden.R L183-L193, CHANGELOG.md L12: the fixture/docs describe an “overall” parity combo, but the Python side routes overall_e1 through aggregate="event_study" and only compares the e=0 row. Impact: the new parity harness does not actually anchor the aggregate="overall" implementation, so the changelog/registry currently overstate the strength of the claimed end-to-end parity signal. Concrete fix: either make overall_e1 use the true two-period aggregate="overall" path, or rename the combo/docs so they no longer claim overall-path coverage.

Path to Approval

In joint_pretrends_test(..., trends_lin=True), drop base_period-1 from the placebo set before residual construction; if the remaining set is empty, do not run stute_joint_pretest.
Add regression tests for the all-consumed case (pre_periods=[base_period-1]) and for mixed cases where base_period-1 appears alongside earlier placebos, asserting that the consumed horizon is absent from the returned labels/stats.
Add trends_lin to did_had_pretest_workflow(...) and forward it to both joint wrappers, with tests showing the workflow matches the direct surfaces on the detrended path.

P0 (joint_pretrends_test consumed-placebo drop): The detrending math collapses dy_t to mechanical zero at t = base-1 (the "consumed" placebo whose F-2 → F-1 evolution is used by the slope estimator). Feeding that all-zero residual into stute_joint_pretest tripped the exact-linear short-circuit and returned a confidently-non-rejecting p_value=1.0 — a placebo that was actually no placebo at all. Fix: compute base_minus_1_period early (right after period_rank/base_rank), filter it out of pre_periods before _aggregate_for_joint_test, emit UserWarning when the filter fires, raise ValueError when nothing testable remains. Mirrors HAD.fit's `e=-2` drop and R's "max placebo lag reduces by 1 with the same effect" convention. Regression tests (3): - consumed placebo dropped + warning emitted when pre_periods includes base_period - 1 - all-consumed case (pre_periods=[base-1]) raises ValueError - no-consumed case (pre_periods excludes base-1) emits no warning P1 (did_had_pretest_workflow trends_lin propagation): Added trends_lin: bool = False kwarg to did_had_pretest_workflow, forwarded to both joint_pretrends_test and joint_homogeneity_test on the event-study path. Front-door rejects trends_lin=True with aggregate="overall" via NotImplementedError (the overall path's qug + stute + yatchew block has no joint-pretest surface). Regression tests (2): - workflow(aggregate=event_study, trends_lin=True) bit-exactly matches direct joint_pretrends_test + joint_homogeneity_test calls - workflow(aggregate=overall, trends_lin=True) raises NotImplementedError P2 (parity test exercises true overall path): The R `overall_e1` combo was being routed through Python's aggregate="event_study" with a single horizon. Now slices the panel to two periods (F-1, F) and routes through aggregate="overall" — exercising the actual two-period overall code path that the changelog and registry claim parity for. Result-extraction logic branches on combo_name to handle the scalar HeterogeneousAdoptionDiDResults (overall) vs the array HeterogeneousAdoptionDiDEventStudyResults (event_study). Stats: 534 tests pass (529 prior + 5 new R1 regressions), 0 regressions. Parity tolerances unchanged: atol=1e-8 (point/SE/CI), atol=1e-10 (Yatchew T-stat after N/(N-1) shift). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-26T13:27:12Z

/ai-review

github-actions · 2026-04-26T13:32:43Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 29399f1241fab6f12e7ec2605c0ab26cd528bd88

Overall Assessment

⚠️ Needs changes

Static review only: this container does not have pytest or numpy, so I could not execute the added tests.

Executive Summary

The prior re-review items look addressed: the direct consumed-placebo bug is fixed in joint_pretrends_test, trends_lin is now forwarded through did_had_pretest_workflow, and the parity suite now exercises the true aggregate="overall" path.
[Newly identified] The composite workflow still mishandles the minimal trends_lin=True event-study case: when the only earlier placebo is the consumed one, it raises instead of skipping step 2.
The documented trends_lin × survey_design exclusion and the placebo-Yatchew null mismatch are properly recorded in docs/methodology/REGISTRY.md / TODO.md, so I did not count them as defects.
I did not find new NaN/inference anti-patterns in the touched estimator paths.
The new public trends_lin kwarg is not yet documented in the affected function docstrings.

Methodology

Affected methods: HeterogeneousAdoptionDiD.fit(..., aggregate="event_study"), joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow.

P1 [Newly identified] — diff_diff/had_pretests.py:L3502-L3550, diff_diff/had_pretests.py:L4475-L4519, diff_diff/had_pretests.py:L4155-L4156, docs/methodology/REGISTRY.md:L2501-L2502: joint_pretrends_test(trends_lin=True) now correctly drops base_period-1 as the consumed placebo, but did_had_pretest_workflow(..., aggregate="event_study", trends_lin=True) still decides whether to run step 2 from earlier_pre = t_pre_list[:-1] before accounting for that drop. On a valid minimal panel with exactly two pre-periods total (for example periods 1, 2, 3, 4 with F=3, so base_period=2 and earlier_pre=[1]), the workflow calls joint_pretrends_test, which then removes 1 and raises “no testable placebo horizons remain.”
Impact: valid trends_lin workflow runs fail outright instead of producing pretrends_joint=None with the existing “joint pre-trends skipped” verdict path. This is the exact empty-result-set edge case the review checklist calls out.
Concrete fix: in did_had_pretest_workflow, compute the effective earlier-placebo set after removing base_period-1 when trends_lin=True; if nothing remains, set pretrends_joint=None and continue to the homogeneity step instead of calling joint_pretrends_test.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the workflow edge case above.

Tech Debt

P3 — TODO.md:L104-L107, docs/methodology/REGISTRY.md:L2501-L2502: the weighted trends_lin gap, placebo-Yatchew mean-independence gap, and external Stute parity gap are explicitly tracked/documented.
Impact: none for merge under the project’s mitigation policy.
Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

P3 — diff_diff/had.py:L2830-L2906, diff_diff/had_pretests.py:L3388-L3423, diff_diff/had_pretests.py:L3778-L3803, diff_diff/had_pretests.py:L4234-L4260: the new public trends_lin kwarg is in the signatures, changelog, and registry, but not in the parameter docs for HeterogeneousAdoptionDiD.fit, joint_pretrends_test, joint_homogeneity_test, or did_had_pretest_workflow.
Impact: generated API docs / IDE help will miss the new contract, especially the consumed-placebo behavior.
Concrete fix: add trends_lin parameter docs to those four docstrings, including the consumed-placebo drop and the workflow’s effective “need one surviving earlier placebo” rule.

Path to Approval

In did_had_pretest_workflow(..., aggregate="event_study", trends_lin=True), remove the consumed placebo from earlier_pre before deciding whether to call joint_pretrends_test; if the set becomes empty, return pretrends_joint=None instead of raising.
Add a regression test on a minimal valid panel (for example periods [1, 2, 3, 4] with F=3) asserting that did_had_pretest_workflow(..., aggregate="event_study", trends_lin=True) succeeds, sets pretrends_joint=None, and still runs homogeneity_joint.

P1 (workflow consumed-placebo handling): did_had_pretest_workflow(aggregate="event_study", trends_lin=True) computed `earlier_pre = t_pre_list[:-1]` and dispatched to joint_pretrends_test BEFORE accounting for the consumed-placebo drop that joint_pretrends_test applies internally. On a minimal valid panel (4 periods, F=3, base=2, earlier_pre=[1]=base-1=consumed), the workflow propagated joint_pretrends_test's "no testable placebo horizons remain" ValueError instead of taking the existing "no earlier placebo → pretrends_joint=None, skip step 2" verdict path. Fix: pre-filter t_pre_list[-2] (= base_period-1) out of earlier_pre when trends_lin=True. If empty, the existing `len(earlier_pre) >= 1` guard sets pretrends_joint=None and step 3 (homogeneity) still runs. Regression test: minimal 4-period F=3 panel, trends_lin=True, asserts workflow succeeds, pretrends_joint is None, homogeneity_joint is populated with finite p_value. P3 (parameter docstrings): Added trends_lin parameter documentation to the four affected public surfaces: - HeterogeneousAdoptionDiD.fit - joint_pretrends_test - joint_homogeneity_test - did_had_pretest_workflow Each docstring covers: the formula, R-package mirror, requirements (F>=3 / base-1 in panel), the consumed-placebo behavior (auto-drop + UserWarning on direct surfaces; skip-step-2 escape on workflow), the survey-weighting NotImplementedError contract, and bit-exact default backcompat. Stats: 535 tests pass (534 prior + 1 new R2 P1 regression), 0 regressions. Parity tolerances unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-26T13:37:20Z

/ai-review

github-actions · 2026-04-26T13:46:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1abd5b5dce154102888f9c1d595b22b1e8933875

Overall Assessment

⚠️ Needs changes

Static review only: this container lacks numpy, pandas, and pytest, so I could not execute the added tests/parity suite.

Executive Summary

Prior re-review items look addressed: the workflow now removes the consumed placebo before deciding whether step 2 runs, and there is a regression for the minimal F=3 case in diff_diff/had_pretests.py:L4534-L4581 and tests/test_had_pretests.py:L4667-L4717.
[Newly identified] The new direct trends_lin=True joint-wrapper surfaces still accept nonterminal base_periods, which silently changes the detrending away from the documented Eq. 17/18 F-1/F-2 construction.
[Newly identified] On ordered-categorical time axes with unused categories, the workflow/joint-wrapper trends_lin path can resolve base_period-1 to an unobserved level and then fail at the slope pivot.
I did not find new safe_inference / NaN-propagation anti-patterns in the changed estimator paths.
The survey-weighted trends_lin gap, placebo-Yatchew null mismatch, and Stute-family parity gap are explicitly tracked in TODO.md / REGISTRY.md, so I did not count them as blockers.

Methodology

P1 [Newly identified] diff_diff/had_pretests.py:L3598-L3612, diff_diff/had_pretests.py:L3644-L3673, diff_diff/had_pretests.py:L3934-L3948, diff_diff/had_pretests.py:L3984-L4025, docs/methodology/REGISTRY.md:L2501-L2502: when trends_lin=True, joint_pretrends_test and joint_homogeneity_test only require base_period to be some validated pre-period, but the documented method is the F-1/F-2 detrending used by Eq. 17/18 / DIDHAD::did_had(..., trends_lin=TRUE). A direct caller can pass base_period=F-2 (or earlier) and get a different slope/detrending rule with no warning. Impact: silent methodological deviation on the new public direct-wrapper surfaces; the workflow and HeterogeneousAdoptionDiD.fit(..., aggregate="event_study") are safe because they always use the terminal pre-period. Concrete fix: when trends_lin=True, require base_period == t_pre_list[-1] after validation in both wrappers, and add regressions showing a nonterminal pre-period base raises.

Code Quality

P1 [Newly identified] diff_diff/had_pretests.py:L2441-L2461, diff_diff/had_pretests.py:L3496-L3559, diff_diff/had_pretests.py:L3990-L4010, diff_diff/had_pretests.py:L4534-L4549: the new trends_lin path resolves base_period-1 from the full ordered-category list, while the workflow’s consumed-placebo logic uses the observed validated pre-period list. With an ordered categorical time_col that has unused categories between observed periods, the workflow can drop one “consumed” placebo but the joint wrappers then look for a different, unobserved predecessor and fail when indexing wide_y[...]. Impact: valid ordered-categorical panels can error in production instead of cleanly using the previous observed period or raising the documented ValueError. Concrete fix: derive the predecessor from the validated observed pre-period sequence (t_pre_list) on both workflow and direct-wrapper paths, or at minimum verify the chosen predecessor exists in the pivot before indexing; add a regression with an unused ordered category between observed periods.

Performance

No findings.

Maintainability

No findings.

Tech Debt

P3 (tracked in TODO/REGISTRY) TODO.md:L105-L107, docs/methodology/REGISTRY.md:L2501-L2502: the weighted trends_lin gap, placebo-Yatchew mean-independence gap, and Stute-family external parity gap are explicitly tracked/documented. Impact: none for merge under the repo’s mitigation policy. Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

P3 benchmarks/R/generate_did_had_golden.R:L22-L29, tests/test_did_had_parity.py:L430-L436, CHANGELOG.md:L12-L12: the parity docs claim a benchmark anchored to DIDHAD v2.0.0 / the cited SHA, but the generator and pytest only enforce >= 2.0.0. Future regeneration can silently re-anchor the goldens to newer R code while still passing. Impact: reproducibility drift in the parity baseline, not current numerical correctness (the committed fixture metadata is still 2.0.0). Concrete fix: pin exact package versions in the generator/requirements and assert exact metadata in pytest, or install from the cited commit SHA.

Path to Approval

In joint_pretrends_test and joint_homogeneity_test, add a trends_lin=True guard that requires base_period to equal the last validated pre-period t_pre_list[-1], and add direct-wrapper regressions showing a nonterminal pre-period base raises.
Make base_period-1 resolution use the previous observed period from the validated panel rather than the full ordered-category list, and add an ordered-categorical regression with an unused intermediate category proving the workflow/joint wrappers no longer fail on that input.

P1 (non-terminal-base methodology guard): joint_pretrends_test(trends_lin=True) and joint_homogeneity_test( trends_lin=True) silently accepted base_period values other than the last validated pre-period (= F-1). Paper Eq 17 / Eq 18 specifically anchor the detrending at F-1 with slope Y[F-1] - Y[F-2]; non- terminal anchors compute a different slope at a different anchor — silently changing the methodology away from the documented Eq 17/18 / R DIDHAD::did_had(trends_lin=TRUE) construction. Fix: under trends_lin=True, require base_period == t_pre_list[-1] inside the validator block. The workflow and HAD.fit always pass F-1, so they're unaffected. Direct callers passing a non-terminal base get a clear ValueError pointing at t_pre_list[-1]. Regression tests (2): - joint_pretrends_test(base=F-2, trends_lin=True) raises - joint_homogeneity_test(base=F-2, trends_lin=True) raises P1 (ordered-categorical observed-period base-1 lookup): The previous base_period - 1 lookup walked period_rank (= the full ordered-categorical level list) decrementing by rank. On panels with unused intermediate categorical levels (e.g. dtype levels [t1, t2, t_unused, t3, t4] with only t1..t4 observed), base=t3 under trends_lin would resolve base-1 to t_unused (rank 2), and the slope-pivot lookup would KeyError because t_unused is not in the data. Fix: derive base_minus_1 from the validator's t_pre_list[-2] — the validator builds t_pre_list from observed contiguous pre- periods only, so it correctly skips unused categorical levels. Both joint wrappers and the workflow now use the same observed- period predecessor, eliminating the failure mode. Restructure: moved the trends_lin consumed-placebo drop from before the validator block into it (so t_pre_list is in scope). Added an early `n_periods < 3` gate for trends_lin so 2-period panels get a clean error instead of silently failing downstream. Regression test: - joint_pretrends_test on an ordered-categorical panel with an unused intermediate level succeeds (no KeyError on slope pivot) Stats: 538 tests pass (535 prior + 3 new R3 P1 regressions), 0 regressions. Parity tolerances unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-26T13:57:26Z

/ai-review

github-actions · 2026-04-26T14:03:38Z

🔁 AI review rerun (requested by @igerber)

Head SHA: acbf699523593cf2f2bbb62c14eb63d56ba8cdb9

Overall Assessment

⛔ Blocker

Static review only: this container does not have numpy/pandas, so I could not execute the added tests.

Executive Summary

Prior re-review issue on nonterminal base_period looks fixed in both direct joint-wrapper surfaces.
[Newly identified] The ordered-categorical fix is incomplete: trends_lin=True now picks F-2 from the observed panel, but the detrending multiplier still uses the full categorical rank, so unused levels can silently change the joint-test statistic and p-value.
The event-study workflow inherits that same bug because it forwards into the same wrappers.
I did not find new inline inference / safe_inference anti-patterns in the changed estimator paths.
The trends_lin × survey_design, placebo-Yatchew null gap, and missing Stute-family parity gaps are properly tracked in TODO.md / REGISTRY.md.
The new R-parity fixture still is not exactly pinned to the cited DIDHAD release/commit, so future regeneration can drift.

Methodology

Severity: P0 [Newly identified]
Impact: Affected methods: joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow(..., aggregate="event_study", trends_lin=True). After the fix, the slope uses the previous observed pre-period (t_pre_list[-2]), but the detrending multiplier still uses _build_period_rank() over the full categorical dtype. On an ordered categorical with unused intermediate levels, the same observed panel can therefore produce different detrending, cvm_stat_joint, and p_value depending only on whether an unused category is present. That is silent wrong statistical output, and it also breaks parity with the event-study fit path, which detrends on observed horizons only. References: diff_diff/had_pretests.py:L3588-L3617, diff_diff/had_pretests.py:L3672-L3701, diff_diff/had_pretests.py:L3985-L4008, diff_diff/had_pretests.py:L4044-L4078, diff_diff/had.py:L1793-L1901, diff_diff/had.py:L4006-L4054, docs/methodology/REGISTRY.md:L2501-L2502.
Concrete fix: Build the detrending rank from the validated observed periods on the filtered panel (same chronology HAD.fit uses), not from the full categorical dtype; alternatively, reject unused intermediate levels and document that limitation. Add a regression that compares the same observed panel with and without unused levels and asserts identical joint statistics.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3 (tracked in TODO/REGISTRY)
Impact: The trends_lin × survey_design gap, placebo yatchew_hr_test(null="mean_independence") gap, and missing Stute-family parity are explicitly tracked and are non-blocking under the repo policy. References: TODO.md:L105-L107, docs/methodology/REGISTRY.md:L2501-L2502.
Concrete fix: None required in this PR.

Security

No findings.

Documentation/Tests

Severity: P3
Impact: The parity harness still allows any DIDHAD >= 2.0.0; requirements.R installs whatever CRAN serves, the generator accepts >=, and pytest only checks >=. Future regeneration can therefore silently re-anchor benchmarks/data/did_had_golden.json to a newer upstream release while the docs/changelog still cite v2.0.0 / edc09197. References: CHANGELOG.md:L11-L12, benchmarks/R/generate_did_had_golden.R:L22-L29, benchmarks/R/requirements.R:L30-L45, tests/test_did_had_parity.py:L430-L436.
Concrete fix: Pin DIDHAD to the cited SHA (or at minimum assert exact version equality in both the generator and pytest); do the same for any other upstream package whose version is part of the parity contract.

Path to Approval

In both joint wrappers, compute the trends_lin detrending delta from the validated observed-period order on data_filtered, not from _build_period_rank(data, time_col).
Add regression coverage for both directions of the bug:
- pretrends: unused level between an earlier placebo and base_period;
- homogeneity/workflow: unused level between base_period and a post period;
  and assert the statistics match the same panel with unused categories removed.

P0 (observed-period detrending delta): The R3 P1 fix made the slope LOOKUP use observed periods, but the detrending DELTA (`t_rank - base_rank`) still pulled ranks from the full categorical dtype via _build_period_rank. On panels with unused intermediate categorical levels, the same observed data produced different (t - base) multipliers and corrupted the joint test statistic — silent wrong statistical output. Fix: under trends_lin=True, build a fresh observed_rank dict from sorted(set(data_filtered[time_col].unique())) and use it for both the base and horizon ranks in the delta computation. Mirrors HAD.fit's `_aggregate_multi_period_first_differences` convention (`sorted(t_pre_list + t_post_list, ...)` for the event-time rank). Both joint wrappers fixed; workflow inherits the fix automatically. Regression tests (2): - joint_pretrends_test on (categorical with 2 unused levels) produces identical cvm_stat_joint and p_value to (categorical without unused levels) on the same observed data - joint_homogeneity_test twin invariant (unused level between base and post) P3 (exact upstream-version pin): The parity contract cited DIDHAD v2.0.0 / SHA edc09197 in CHANGELOG, REGISTRY, and the parity test docstrings, but the generator and test only enforced `>= 2.0.0`. Future regeneration could silently re-anchor goldens to a newer release while docs still cited the old version. Fix: pin exactly DIDHAD == 2.0.0 and YatchewTest == 1.1.1 in both the generator's stopifnot guards and the parity test's metadata assertion. Document the bump procedure in the comments. Stats: 540 tests pass (538 prior + 2 new R4 P0 regressions), 0 regressions. All 24 R-parity cells still green at atol=1e-8 / 1e-10. Note: existing categorical-level invariance tests added in R3 still pass — they exercised correctness on simple unused-level shifts; the R4 invariants are stricter, asserting bit-exact identity of the joint statistics across categorical re-ordering of the same observed panel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-26T14:10:50Z

/ai-review

github-actions · 2026-04-26T14:17:46Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b7b7eb3f249f0e0929e9d03cbccf20a5e9ae7caa

Overall Assessment
✅ Looks good

Executive Summary

Previous blocker on ordered-categorical unused levels appears resolved: the joint wrappers now derive both the predecessor period and the detrending multiplier from validated observed periods, matching HAD.fit (diff_diff/had_pretests.py:L3598-L3617, diff_diff/had_pretests.py:L3696-L3718, diff_diff/had_pretests.py:L4002-L4025, diff_diff/had_pretests.py:L4091-L4107, diff_diff/had.py:L1793-L1901, diff_diff/had.py:L4006-L4054).
I did not find an unmitigated methodology deviation in the shipped trends_lin estimator/pretest paths.
The remaining methodology gaps are properly documented/tracked and are non-blocking under repo policy: trends_lin × survey_design, placebo Yatchew mean-independence mode, and missing Stute-family parity (docs/methodology/REGISTRY.md:L2501-L2502, TODO.md:L104-L107).
Non-blocking issues remain in the parity/repro harness: nprobust is still not exact-pinned even though the parity claim depends on it, and the Python parity test does not assert exact horizon shape/order (benchmarks/R/generate_did_had_golden.R:L22-L34, benchmarks/R/requirements.R:L16-L18, tests/test_did_had_parity.py:L113-L156, tests/test_did_had_parity.py:L168-L188, tests/test_did_had_parity.py:L224-L310, tests/test_did_had_parity.py:L430-L447).
Static review only: this container lacks pytest, numpy, and pandas, so I could not execute the new tests.

Methodology

Severity: P3 (tracked in REGISTRY.md / TODO.md). Impact: The remaining differences from R are documented rather than silent: bias-corrected point-estimate convention, Yatchew denominator convention, survey-weighted trends_lin deferral, placebo-Yatchew null gap, and missing Stute-family parity. Concrete fix: None required in this PR. References: docs/methodology/REGISTRY.md:L2501-L2504, TODO.md:L104-L107.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: The repro harness is not fully self-pinned. generate_did_had_golden.R/pytest now exact-pin DIDHAD and YatchewTest, but not nprobust, and benchmarks/R/requirements.R still installs whatever CRAN currently serves. Because the parity contract explicitly runs through nprobust numerical paths, a fresh regeneration can still drift or fail depending on upstream CRAN state. Concrete fix: exact-pin nprobust alongside the other upstreams and make benchmarks/R/requirements.R install the archived pinned versions. References: benchmarks/R/generate_did_had_golden.R:L22-L34, benchmarks/R/requirements.R:L16-L18, tests/test_did_had_parity.py:L430-L447.

Tech Debt

Severity: P3 (tracked in TODO.md). Impact: The PR correctly records deferred work for survey-weighted trends_lin, placebo Yatchew parity, and Stute-family parity, so these are non-blocking under the project policy. Concrete fix: None required in this PR. References: TODO.md:L105-L107.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: did_had_pretest_workflow’s top-level docstring still says Eq. 18 linear-trend detrending is deferred, which now contradicts the shipped trends_lin behavior and the updated registry/changelog. Concrete fix: update diff_diff/had_pretests.py:L4325-L4327 to describe the shipped forwarding behavior.
Severity: P3. Impact: tests/test_did_had_parity.py compares mapped R rows to Python rows but does not assert exact event_times length/order, and _python_fit() ignores effects/placebo on the event-study path. That weakens the claimed “end-to-end parity” coverage for future horizon-shape regressions. Concrete fix: assert the exact mapped event_times sequence/length (and optionally QUG/bandwidth metadata) before value comparisons, or slice Python output explicitly to the requested effects/placebos. References: tests/test_did_had_parity.py:L113-L156, tests/test_did_had_parity.py:L168-L188, tests/test_did_had_parity.py:L224-L310.

R5 was ✅ Looks good — only P3 polish remained. All addressed: P3 #1 — exact-pin nprobust: The parity contract runs through nprobust numerical paths (DIDHAD's local-linear bandwidth + bias-correction calls), so a fresh regeneration could drift if CRAN serves a newer nprobust. Pin nprobust == 0.5.0 in both the R generator's stopifnot guard and the parity test's metadata assertion alongside DIDHAD and YatchewTest. P3 #2 — workflow docstring: did_had_pretest_workflow's top-level docstring still said "Eq 18 linear-trend detrending is a Phase 4 follow-up" which contradicts the shipped trends_lin behavior. Updated to describe the forwarding contract (trends_lin → joint_pretrends_test + joint_homogeneity_test, consumed-placebo skip path on minimal panels). Same fix on the StuteJointResult class docstring. P3 #3 — parity test horizon-shape assertions: Added an explicit "missing in Python" assertion in _zip_r_python: every R-mapped event time must be present in Python's event_times (catches future horizon-shape regressions where Python silently drops a horizon R requested). Added an effects+placebo row-count sanity check in test_yatchew_t_stat_parity (uses the previously- unused effects/placebo parametrize values to catch fixture drift). Stats: 540 tests pass, 0 regressions. No estimator/methodology changes — all P3 polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-26T14:23:56Z

/ai-review

github-actions · 2026-04-26T14:28:17Z

🔁 AI review rerun (requested by @igerber)

Head SHA: eb27bf5914e9c747eeedeca2a68e595e3705d265

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

The new trends_lin paths follow the shipped methodology contract: event-study only, F >= 3, canonical F-1 anchor on the joint wrappers, consumed-placebo drop, and survey-weight rejection are all enforced in code (diff_diff/had.py:L2939-L3041, diff_diff/had.py:L4006-L4052, diff_diff/had_pretests.py:L3581-L3712, diff_diff/had_pretests.py:L4004-L4108, docs/methodology/REGISTRY.md:L2501-L2502).
The earlier ordered-categorical / unused-level risk is addressed. The joint wrappers now derive both the predecessor period and detrending rank from validated observed periods, and regression tests lock that invariant (diff_diff/had_pretests.py:L3600-L3619, diff_diff/had_pretests.py:L3698-L3712, diff_diff/had_pretests.py:L4025-L4108, tests/test_had_pretests.py:L4757-L4912).
The workflow surface/docs are now consistent with shipped behavior, including the minimal-panel “skip step 2, still run step 3” case (diff_diff/had_pretests.py:L4327-L4399, diff_diff/had_pretests.py:L4626-L4690, tests/test_had_pretests.py:L4667-L4717).
Severity P3: the repro installer is still not version-aware. benchmarks/R/requirements.R installs current CRAN heads and does nothing when the wrong version is already installed, while the generator/test now require exact versions; clean regeneration can still fail immediately (benchmarks/R/requirements.R:L30-L53, benchmarks/R/generate_did_had_golden.R:L28-L37, tests/test_did_had_parity.py:L460-L487).
Severity P3: the parity harness is stronger than before, but _python_fit() still ignores effects/placebo on the event-study path and _zip_r_python() only enforces subset inclusion, so horizon-selection regressions could still slip through while mapped-row comparisons pass (tests/test_did_had_parity.py:L113-L156, tests/test_did_had_parity.py:L168-L204).
Static review only: pytest is not installed in this container, so I could not execute the targeted test modules.

Methodology

No findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3. Impact: benchmarks/R/requirements.R does not install or correct to the exact DIDHAD / YatchewTest / nprobust versions that the generator and parity test now hard-pin, so the repro path is still brittle on a fresh or drifted R environment. Concrete fix: make benchmarks/R/requirements.R version-aware and install archived exact versions when the installed version is missing or mismatched. References: benchmarks/R/requirements.R:L30-L53, benchmarks/R/generate_did_had_golden.R:L28-L37, tests/test_did_had_parity.py:L460-L487.

Tech Debt

Severity P3 (informational, tracked). Impact: the remaining gaps — trends_lin × survey_design, placebo Yatchew mean-independence mode, and Stute-family parity — are already tracked and therefore non-blocking under repo policy. Concrete fix: None required in this PR. References: TODO.md:L105-L107, docs/methodology/REGISTRY.md:L2501-L2502.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the parity harness still does not constrain the full Python event-study horizon set to the requested R effects / placebo surface; it only checks that the R-mapped horizons exist in Python output. That weakens coverage for future horizon-selection regressions. Concrete fix: derive the expected horizon list from the R combo, slice Python arrays to that ordered set, and assert exact equality of the compared horizon sequence before value comparisons. References: tests/test_did_had_parity.py:L113-L156, tests/test_did_had_parity.py:L168-L204.
Execution note: pytest is unavailable in this container, so this review is based on static inspection only.

R6 was ✅ Looks good — 2 P3 polish items. P3 #1 — version-aware repro installer: benchmarks/R/requirements.R installed whatever CRAN currently served via install.packages, while the generator and parity test hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust == 0.5.0. A fresh R environment regenerating the goldens would have the generator's stopifnot(packageVersion == "X.Y.Z") immediately abort. Fix: add `install_pinned_version()` helper using remotes::install_version with `upgrade = "never"`, run it after the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent when the correct version is already installed. Bump procedure documented in lockstep with the generator + parity-test pins. P3 #2 — exact-set parity event_times: _zip_r_python() previously asserted only that R-mapped horizons were a SUBSET of Python's event_times (missing-in-python check). Tighten to FULL SET EQUALITY: also reject horizons present in Python but absent from R's requested set ("extra_in_python"). This catches future event_study horizon-selection regressions in both directions — e.g. if our effects/placebo cap drifts and Python emits an extra row R didn't request. Stats: 540 tests pass, 0 regressions. Still no estimator changes — all P3 polish on the parity / repro infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-26T14:30:55Z

/ai-review

github-actions · 2026-04-26T14:37:29Z

🔁 AI review rerun (requested by @igerber)

Head SHA: aae3531915ad7d1ab2517bfb140e6788f1865d3d

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Previous re-review items are addressed. The R installer now enforces exact upstream pins, and the parity harness now exercises the true two-period overall path and checks exact horizon-set equality instead of subset inclusion. References: benchmarks/R/requirements.R:L39-L98, tests/test_did_had_parity.py:L113-L156, tests/test_did_had_parity.py:L168-L211.
The new trends_lin implementation matches the shipped registry contract across all three methodology surfaces: event-study-only gate, F >= 3 / F-1 anchor enforcement, consumed-placebo drop, and survey-weight rejection are all implemented and covered by regression tests. References: diff_diff/had.py:L2939-L2968, diff_diff/had.py:L4006-L4054, diff_diff/had_pretests.py:L3521-L3642, diff_diff/had_pretests.py:L4004-L4108, docs/methodology/REGISTRY.md:L2501-L2502, tests/test_had_pretests.py:L4364-L4867, tests/test_had_pretests.py:L5013-L5039.
The earlier ordered-categorical / unused-level risk is fixed. The joint wrappers now derive both the predecessor period and the detrending delta from validated observed periods, with invariant tests on both pretrends and homogeneity. References: diff_diff/had_pretests.py:L3600-L3618, diff_diff/had_pretests.py:L3698-L3717, diff_diff/had_pretests.py:L4004-L4027, diff_diff/had_pretests.py:L4093-L4108, tests/test_had_pretests.py:L4757-L4867.
The remaining placebo-Yatchew parity gap and trends_lin × survey_design limitation are explicitly tracked in TODO.md, so they are non-blocking under repo policy. References: TODO.md:L105-L107, tests/test_did_had_parity.py:L360-L370, tests/test_did_had_parity.py:L438-L443.
Static review only. pytest is unavailable in this container, and a direct Python smoke test also failed because pandas is not installed here.

Methodology

Severity P3 (informational). Impact: the only R deviations surfaced by this PR are the bias-corrected point-estimate convention and the Yatchew G/(G-1) finite-sample convention shift; both are explicitly documented in the Methodology Registry and mirrored by the parity assertions, so they are non-defects under repo policy. Concrete fix: None required. References: docs/methodology/REGISTRY.md:L2501-L2502, tests/test_did_had_parity.py:L293-L333, tests/test_did_had_parity.py:L344-L460.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3 (informational, tracked). Impact: placebo Yatchew parity still intentionally skips R’s order=0 mean-independence null, and trends_lin × survey_design plus Stute-family external parity remain deferred, but all are now explicitly tracked in TODO.md, so they are non-blocking. Concrete fix: None required in this PR. References: TODO.md:L105-L107, docs/methodology/REGISTRY.md:L2501-L2502.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new trends_lin parameter docs for the joint wrappers say the panel must contain base_period - 1, but the actual public contract is stricter: direct callers must also pass base_period == F-1 (the last validated pre-period). The implementation and tests are correct; the docstring just understates the contract. Concrete fix: update the trends_lin parameter docs on joint_pretrends_test and joint_homogeneity_test to state explicitly that base_period must be the canonical F-1 anchor. References: diff_diff/had_pretests.py:L3424-L3440, diff_diff/had_pretests.py:L3867-L3878, diff_diff/had_pretests.py:L3590-L3599, diff_diff/had_pretests.py:L4008-L4016.
Execution note: static review only; I could not run the targeted tests in this container.

R7 was ✅ Looks good — only 1 P3 docstring nit. The trends_lin parameter docstrings on joint_pretrends_test and joint_homogeneity_test only documented the "base_period - 1 must exist in panel" requirement, but the actual public contract is stricter: direct callers must also pass `base_period == F-1` (the last validated pre-period). The PR #392 R3 P1 guard enforces this in code; the docstring just understated it. Updated both docstrings to state explicitly that base_period must be the canonical F-1 anchor, not just any pre-period — and that non-terminal anchors raise ValueError per Eq 17 / R semantics. Stats: 540 tests pass, 0 regressions. Doc-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-26T14:41:14Z

/ai-review

github-actions · 2026-04-26T14:50:13Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bbf75e7359df894b059fd2a5ac61ef0091a0e046

Overall Assessment

✅ Looks good - no unmitigated P0/P1 findings.

Executive Summary

Previous re-review items appear addressed: the R parity path now hard-pins upstream package versions, the fixture/test metadata enforce those pins, the parity harness now checks exact horizon-set equality, and the joint-wrapper docstrings now state the canonical F-1 anchor requirement. References: benchmarks/R/requirements.R:L39-L98, benchmarks/R/generate_did_had_golden.R:L28-L37, tests/test_did_had_parity.py:L168-L211, diff_diff/had_pretests.py:L3424-L3446, diff_diff/had_pretests.py:L3873-L3888.
The affected methodology surfaces are HeterogeneousAdoptionDiD.fit(..., aggregate="event_study"), joint_pretrends_test, and joint_homogeneity_test. The new trends_lin path matches the documented Eq. 17 / Eq. 18 contract on the key guards: event-study-only scope, F >= 3, canonical F-1 anchoring, consumed-placebo drop, observed-period predecessor lookup, and workflow forwarding. References: diff_diff/had.py:L2939-L2968, diff_diff/had.py:L4006-L4054, diff_diff/had_pretests.py:L3473-L3648, diff_diff/had_pretests.py:L3915-L4120, diff_diff/had_pretests.py:L4636-L4700, tests/test_had_pretests.py:L4418-L4717.
I did not find any undocumented estimator/math/variance deviations. The only numerical differences from R surfaced by this PR are the two deviations already documented in the Methodology Registry and reflected in the parity assertions. References: docs/methodology/REGISTRY.md:L2501-L2502, tests/test_did_had_parity.py:L293-L333, tests/test_did_had_parity.py:L344-L461.
One non-blocking note remains: the changelog/registry describe general HAD-to-R end-to-end parity, but the harness explicitly forces HeterogeneousAdoptionDiD(design="continuous_at_zero"), so it validates the Design 1' parity path rather than the library's default design="auto" surface. References: CHANGELOG.md:L12-L12, docs/methodology/REGISTRY.md:L2502-L2502, tests/test_did_had_parity.py:L119-L156, diff_diff/had.py:L2641-L2653.
Static review only. The changed Python files compile, but I could not run pytest here because the container is missing numpy/pandas.

Methodology

Severity P3 (informational). Impact: the PR's two intentional R deviations - bias-corrected point estimate vs. R's conventional Estimate, and the Yatchew G/(G-1) finite-sample convention shift - are explicitly documented in the registry and implemented in the parity assertions, so they are non-defects under the repo policy. Concrete fix: None. References: docs/methodology/REGISTRY.md:L2502-L2502, tests/test_did_had_parity.py:L293-L333, tests/test_did_had_parity.py:L344-L461.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3 (informational, tracked). Impact: placebo-row Yatchew parity remains intentionally out of scope because R uses YatchewTest::yatchew_test(order=0) (Y ~ 1) on placebo rows while Python only exposes the linearity-null Y ~ D surface. This limitation is already tracked in TODO.md, so it is non-blocking. Concrete fix: None required in this PR; the tracked follow-up is yatchew_hr_test(null="mean_independence"). References: TODO.md:L106-L106, tests/test_did_had_parity.py:L360-L370, tests/test_did_had_parity.py:L438-L443.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new changelog/registry wording overstates the scope of the R-parity claim. The actual harness explicitly forces HeterogeneousAdoptionDiD(design="continuous_at_zero") because design="auto" can legitimately choose a different surface, so the PR does not validate parity for the default public estimator configuration. Concrete fix: either narrow the changelog/registry wording to the design="continuous_at_zero" parity contract, or add separate coverage plus an explicit note for the expected design="auto" divergence. References: CHANGELOG.md:L12-L12, docs/methodology/REGISTRY.md:L2502-L2502, tests/test_did_had_parity.py:L119-L156, diff_diff/had.py:L2641-L2653.

Execution note: static review only; I could not run the targeted tests in this container, though compile(...) succeeded for the changed Python files.

R8 was ✅ Looks good — only 1 P3 doc nit. The CHANGELOG and REGISTRY parity claim said "HAD R-package end- to-end parity" without qualifying that the harness explicitly forces `HeterogeneousAdoptionDiD(design="continuous_at_zero")`. R `did_had` always evaluates the local-linear at d=0 regardless of dose distribution; our default `design="auto"` may legitimately resolve to `continuous_near_d_lower` or `mass_point` on dose distributions with boundary density bounded away from zero (e.g. Beta(2,2) at G=200), in which case the WAS estimand evaluates at a different point and diverges from R numerically. That divergence is methodologically defensible — our auto-detect uses more information when boundary mass is sparse — but it means the parity test does NOT validate the default `design="auto"` surface, only the Design 1' surface that R also uses. Updated both wording surfaces to qualify "on the `design='continuous_at_zero'` (Design 1') surface" and explain the auto-detect divergence as out-of-scope-for-this-test (not a defect). Stats: 540 tests pass, 0 regressions. Doc-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-26T14:52:49Z

/ai-review

github-actions · 2026-04-26T14:58:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 476cc3233065c513e1336377f8917a1a919e05f6

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Previous re-review items appear addressed: the parity claim is now correctly scoped to the design="continuous_at_zero" surface, and the R generator / installer / metadata test now hard-pin DIDHAD, YatchewTest, and nprobust. CHANGELOG.md L11-L12, docs/methodology/REGISTRY.md L2501-L2502, benchmarks/R/generate_did_had_golden.R L28-L37, benchmarks/R/requirements.R L39-L98, tests/test_did_had_parity.py L467-L494
Affected methodology surfaces are HeterogeneousAdoptionDiD.fit(..., aggregate="event_study"), joint_pretrends_test, and joint_homogeneity_test. The new trends_lin path matches the documented Eq. 17 / Eq. 18 contract on the load-bearing pieces: F-1 anchoring, F >= 3, consumed-placebo drop, ordered-period handling, and explicit survey-weight rejection. diff_diff/had.py L2906-L2968, diff_diff/had.py L4006-L4054, diff_diff/had_pretests.py L3424-L3700, diff_diff/had_pretests.py L3873-L4118, docs/methodology/REGISTRY.md L2501-L2502
I did not find an undocumented estimator/math/variance deviation in the changed code. The two intentional numerical deviations from R remain explicitly documented in the registry and mirrored in the parity assertions. docs/methodology/REGISTRY.md L2502-L2502, tests/test_did_had_parity.py L247-L333, tests/test_did_had_parity.py L336-L461
Remaining limitations are tracked rather than silent bugs: trends_lin × survey_design is explicitly rejected, and placebo-row Yatchew parity is intentionally out of scope because R uses a different null (order=0, Y ~ 1). diff_diff/had.py L2945-L2968, diff_diff/had_pretests.py L3473-L3486, diff_diff/had_pretests.py L3915-L3926, TODO.md L105-L107, tests/test_did_had_parity.py L360-L370
Review was static-only. I could not run pytest here because the container lacks pytest, and importing the package for a smoke test failed because numpy is not installed. I did verify the changed Python files parse with stdlib compile().

Methodology

No findings. The changed estimator/pretest logic is consistent with the Methodology Registry’s documented trends_lin contract and with the in-code Eq. 17 / Eq. 18 references. diff_diff/had.py L2906-L2968, diff_diff/had.py L4006-L4054, diff_diff/had_pretests.py L3587-L3718, diff_diff/had_pretests.py L4014-L4118, docs/methodology/REGISTRY.md L2501-L2502

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The prior parity-drift concern is mitigated by version pinning in the generator, installer, and fixture-metadata test. benchmarks/R/generate_did_had_golden.R L28-L37, benchmarks/R/requirements.R L39-L98, tests/test_did_had_parity.py L467-L494

Tech Debt

Severity P3 (informational, tracked). Impact: trends_lin=True remains intentionally unsupported with survey weighting, but the PR fails fast with NotImplementedError on all three affected surfaces and tracks the follow-up in TODO.md, so there is no silent statistical error. Concrete fix: none required in this PR; only add support after deriving and validating a weighted slope estimator. diff_diff/had.py L2945-L2968, diff_diff/had_pretests.py L3473-L3486, diff_diff/had_pretests.py L3915-L3926, TODO.md L105-L105
Severity P3 (informational, tracked). Impact: placebo-row Yatchew parity is still intentionally skipped because R’s placebo path uses YatchewTest(order=0) / Y ~ 1, while Python only exposes the linearity-null Y ~ D test. Concrete fix: none required in this PR; add a null="mean_independence" mode if placebo parity becomes a goal. tests/test_did_had_parity.py L360-L370, TODO.md L106-L106, docs/methodology/REGISTRY.md L2502-L2502

Security

No findings.

Documentation/Tests

No findings. The parity-scope wording is now accurate, and the new tests cover the main trends_lin edge cases plus the documented R-parity surface. CHANGELOG.md L11-L12, docs/methodology/REGISTRY.md L2501-L2502, tests/test_had_pretests.py L4247-L5103, tests/test_did_had_parity.py L247-L494

Adds `trends_lin: bool = False` keyword-only kwarg to `HeterogeneousAdoptionDiD.fit(aggregate="event_study")`, `joint_pretrends_test`, and `joint_homogeneity_test`. Mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)` (paper Eq 17 / Eq 18 / page 32 joint-Stute homogeneity-with-trends). Per-group linear-trend slope estimated as `Y[g, F-1] - Y[g, F-2]`; applied uniformly as `(t - base) × slope` adjustment that absorbs both the effect-side detrending and the placebo-side anchor swap. Requires F ≥ 3 (panel must contain F-2). The "consumed" placebo at our event-time `e=-2` is auto-dropped (R reduces max placebo lag by 1 with the same effect). Mutually exclusive with survey weighting; raises NotImplementedError. Bit-exact backcompat for trends_lin=False (default) across all existing surfaces. Adds end-to-end R-package parity test vs `DIDHAD` v2.0.0 (Credible-Answers/did_had, SHA `edc09197`). New generator `benchmarks/R/generate_did_had_golden.R` produces `benchmarks/data/did_had_golden.json` covering 3 paper-derived synthetic DGPs (Uniform, Beta(2,2), Beta(0.5,1)) × 5 method combinations (overall, event-study, placebo, yatchew, trends_lin). `tests/test_did_had_parity.py` asserts: - point estimate / SE / CI bounds at atol=1e-8 - closed-form Yatchew T-stat at atol=1e-10 after a documented `× G/(G-1)` finite-sample convention shift on all 24 (DGP × combo) cells. Two intentional convention deviations from R, documented in REGISTRY.md and the parity test docstring: (a) we report the bias-corrected point estimate (modern CCF 2018 convention); R's `Estimate` reports the conventional estimate with the bias-corrected CI separately. Our `att` matches R's CI midpoint. (b) Yatchew uses paper Appendix E's literal (1/G) variance denominator; R uses base-R `var()`'s (1/(N-1)) sample-variance convention. Ratio is exactly N/(N-1); both converge to the same asymptotic null. Yatchew on placebos with R's mean-independence null (`order=0`) is not exposed in our `yatchew_hr_test` and is skipped in the parity test; tracked as TODO follow-up. Stats: - 16 new direct trends_lin unit tests in test_had_pretests.py - 24 new R-parity tests in test_did_had_parity.py - 489 baseline + 16 + 24 = 529 tests pass, 0 regressions - Generator script: ~280 LoC; fixture: 117 KB Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>